Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it learn for themselve.
There are three different task done by the Machine Learning Algorithms
This article is confined to classification.
This is the problem of identifing to which set of categories my new observation Belongs.The goal of classification is to find Boundaries that best separate different categories of data.
There are two type of classification.
1)Binary Classification. This type of classification is done when we have only 2 class to classify.
2)Multi-class classification. This type of classification is done when we have more than 2 class to classify.
Today we are going to deal with Multi-class classification problem by XGBoost.
There is a device which has collected the data ,now the device needs to dedicate any new floor by the help of previous data.
we are going to use Machine Learning Algorithms and build a prediction model for this device which can predict on which floor it’s standing by collecting the properties of the floor.
library ’s used.
library(data.table)
library(ggplot2)
library(plotly)
library(dplyr)
library(corrplot)
library(kableExtra)
library(mltools)
library(caTools)
library(xgboost)
library(Ckmeans.1d.dp)
library(Matrix)
library(readr)
library(car)
library(caret)
library(lattice)
#The package for random forest function
library(MASS)
library(randomForest)
#correlation plot
library(corrplot)
Importing all the data set into R.we are dropping the row_id, group_id as all the variables in this column are unique,so this does not come handy for our analysis.
train_data<-merge(x_train,y_train,by="series_id")
Viewing the structure of the both training and testing data.
str(x_train)
Classes ‘data.table’ and 'data.frame': 487680 obs. of 12 variables:
$ series_id : int 0 0 0 0 0 0 0 0 0 0 ...
$ measurement_number : int 0 1 2 3 4 5 6 7 8 9 ...
$ orientation_X : num -0.759 -0.759 -0.759 -0.759 -0.759 ...
$ orientation_Y : num -0.634 -0.634 -0.634 -0.634 -0.634 ...
$ orientation_Z : num -0.105 -0.105 -0.105 -0.105 -0.105 ...
$ orientation_W : num -0.106 -0.106 -0.106 -0.106 -0.106 ...
$ angular_velocity_X : num 0.10765 0.06785 0.00727 -0.01305 0.00513 ...
$ angular_velocity_Y : num 0.01756 0.02994 0.02893 0.01945 0.00765 ...
$ angular_velocity_Z : num 0.000767 0.003386 -0.005978 -0.008974 0.005245 ...
$ linear_acceleration_X: num -0.749 0.34 -0.264 0.427 -0.51 ...
$ linear_acceleration_Y: num 2.1 1.51 1.59 1.1 1.47 ...
$ linear_acceleration_Z: num -9.75 -9.41 -8.73 -10.1 -10.44 ...
- attr(*, ".internal.selfref")=<externalptr>
str(y_train)
Classes ‘data.table’ and 'data.frame': 3810 obs. of 2 variables:
$ series_id: int 0 1 2 3 4 5 6 7 8 9 ...
$ surface : Factor w/ 9 levels "carpet","concrete",..: 3 2 2 2 7 8 6 2 5 8 ...
- attr(*, ".internal.selfref")=<externalptr>
Now we are going to join x_train and y_train by series_id.
train_data<-merge(x_train,y_train,by="series_id")
Seing the columns names.
colnames(train_data)
[1] "series_id" "measurement_number" "orientation_X" "orientation_Y" "orientation_Z"
[6] "orientation_W" "angular_velocity_X" "angular_velocity_Y" "angular_velocity_Z" "linear_acceleration_X"
[11] "linear_acceleration_Y" "linear_acceleration_Z" "surface"
so these are our feature where “Surface”" is our target feature and all others act as predictors in our model building.
Lets check how many types of floor do we have.
table(train_data$surface)
carpet concrete fine_concrete hard_tiles hard_tiles_large_space soft_pvc
24192 99712 46464 2688 39424 93696
soft_tiles tiled wood
38016 65792 77696
so we have 9 Types of floor.
so we need to do multiclass classification.
Summary of the data.
summary(train_data)
series_id measurement_number orientation_X orientation_Y orientation_Z orientation_W angular_velocity_X
Min. : 0 Min. : 0.00 Min. :-0.98910 Min. :-0.98965 Min. :-0.16283 Min. :-0.156620 Min. :-2.3710000
1st Qu.: 952 1st Qu.: 31.75 1st Qu.:-0.70512 1st Qu.:-0.68898 1st Qu.:-0.08947 1st Qu.:-0.106060 1st Qu.:-0.0407520
Median :1904 Median : 63.50 Median :-0.10596 Median : 0.23786 Median : 0.03195 Median :-0.018704 Median : 0.0000842
Mean :1904 Mean : 63.50 Mean :-0.01805 Mean : 0.07506 Mean : 0.01246 Mean :-0.003804 Mean : 0.0001775
3rd Qu.:2857 3rd Qu.: 95.25 3rd Qu.: 0.65180 3rd Qu.: 0.80955 3rd Qu.: 0.12287 3rd Qu.: 0.097215 3rd Qu.: 0.0405272
Max. :3809 Max. :127.00 Max. : 0.98910 Max. : 0.98898 Max. : 0.15571 Max. : 0.154770 Max. : 2.2822000
angular_velocity_Y angular_velocity_Z linear_acceleration_X linear_acceleration_Y linear_acceleration_Z surface
Min. :-0.927860 Min. :-1.268800 Min. :-36.0670 Min. :-121.490 Min. :-75.386 concrete :99712
1st Qu.:-0.033191 1st Qu.:-0.090743 1st Qu.: -0.5308 1st Qu.: 1.958 1st Qu.:-10.193 soft_pvc :93696
Median : 0.005412 Median :-0.005335 Median : 0.1250 Median : 2.880 Median : -9.365 wood :77696
Mean : 0.008338 Mean :-0.019184 Mean : 0.1293 Mean : 2.886 Mean : -9.365 tiled :65792
3rd Qu.: 0.048068 3rd Qu.: 0.064604 3rd Qu.: 0.7923 3rd Qu.: 3.799 3rd Qu.: -8.523 fine_concrete :46464
Max. : 1.079100 Max. : 1.387300 Max. : 36.7970 Max. : 73.008 Max. : 65.839 hard_tiles_large_space:39424
(Other) :64896
Now checking the NA ’s and Empty spaces in the data.
cat("\nThe Total number of NA 's in the train data is :- ",sum(is.na(train_data)))
The Total number of NA 's in the train data is :- 0
cat("\nThe Total number of empty spaces 's in the train data is :- ",sum(train_data==" "))
The Total number of empty spaces 's in the train data is :- 0
Checking the percentage of each class in the surface feature(Target Variable),checking the ImmBalanced class.
#Donet chart which display the percentage of each class in the column
p <- train_data %>%
group_by(surface) %>%
summarize(count = n()) %>%
plot_ly(labels = ~surface, values = ~count) %>%
add_pie(hole = 0.6) %>%
layout(title = "The Percentage of category at Surface column", showlegend = F,
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
p
As each categories are distributed uniformly ,so there is no problem of miss classification.
Encoding the target variable.
train_data$surface <-as.numeric(as.factor(train_data$surface))-1
Actual Values Encoded Values
carpet 0 concrete 99712 1 fine_concrete 46464 2 hard_tiles 2688 3 hard_tiles_large_space 39424 4 soft_pvc 93696 5 soft_tiles 38016 6 tiled 65792 7 wood 7 8
corr<-cor(x_train)
corrplot(corr,type = "lower")
There are high positive correlation between the predictor variable orientation_w with orientation_x,orientation_z with orientation_y and there is high negative correlation between the predictor variables angular_velocity_y and angular_velocity_z.
As we are using XGBoost for our model building so we does not need to handle multicollinearity as XGBoost can handle it by it self.
Dividing the data set into train test.
set.seed(1)
## 75% of the sample size
smp_size <- floor(0.75 * nrow(train_data))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(train_data)), size = smp_size)
train_train_data <- train_data[train_ind,]
test_train_data <- train_data[-train_ind,]
Now checking the classs imbalance in the training data.
#Donet chart which display the percentage of each class in the column
p <- train_train_data %>%
group_by(surface) %>%
summarize(count = n()) %>%
plot_ly(labels = ~surface, values = ~count) %>%
add_pie(hole = 0.6) %>%
layout(title = "The Percentage of category at Surface column", showlegend = F,
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
p
As we see all the categorical variables are properly balanced ,so now we use this train_train_data for Model Building.
Now applying XGBoost algorithm for classification.
preparing the data for xgboost algorithm. note :- XGBoost can not handle categorical or numeric data. we should always pass the data in for of matrix for xgboost.
dtrain <- xgb.DMatrix(as.matrix(train_train_data[,-"surface"]), label = as.matrix(train_train_data$surface))
dtestfinal<-xgb.DMatrix(as.matrix(test_train_data[,-"surface"]), label = as.matrix(test_train_data$surface))
Setting parameters for xgboost
#default parameters
params <- list(booster = "gbtree",num_class=9 ,objective = "multi:softmax", eta=0.2, max_depth=4, min_child_weight=2, subsample=1, colsample_bytree=1)
Doing parameter tuning
#find best nround
cv<-xgb.cv( params = params, data = dtrain, nrounds = 50, nfold = 5,gamma=0, showsd = T, stratified = T, print.every.n = 10, early.stop.round = 20, maximize = F)
'print.every.n' is deprecated.
Use 'print_every_n' instead.
See help("Deprecated") and help("xgboost-deprecated").'early.stop.round' is deprecated.
Use 'early_stopping_rounds' instead.
See help("Deprecated") and help("xgboost-deprecated").
[1] train-merror:0.351930+0.002603 test-merror:0.352816+0.003646
Multiple eval metrics are present. Will use test_merror for early stopping.
Will train until test_merror hasn't improved in 20 rounds.
[11] train-merror:0.214109+0.003833 test-merror:0.215639+0.005167
[21] train-merror:0.172163+0.002382 test-merror:0.174437+0.003537
[31] train-merror:0.142943+0.002307 test-merror:0.145601+0.002569
[41] train-merror:0.124097+0.001695 test-merror:0.127124+0.001975
[50] train-merror:0.111857+0.000658 test-merror:0.115286+0.000921
cv$best_iteration
[1] 50
print(cv)
##### xgb.cv 5-folds
Best iteration:
print(cv, verbose=TRUE)
##### xgb.cv 5-folds
call:
xgb.cv(params = params, data = dtrain, nrounds = 50, nfold = 5,
showsd = T, stratified = T, maximize = F, gamma = 0, print.every.n = 10,
early.stop.round = 20)
params (as set within xgb.cv):
booster = "gbtree", num_class = "9", objective = "multi:softmax", eta = "0.2", max_depth = "4", min_child_weight = "2", subsample = "1", colsample_bytree = "1", gamma = "0", print_every_n = "10", early_stop_round = "20", silent = "1"
callbacks:
cb.print.evaluation(period = print_every_n, showsd = showsd)
cb.evaluation.log()
cb.early.stop(stopping_rounds = early_stopping_rounds, maximize = maximize,
verbose = verbose)
niter: 50
best_iteration: 50
best_ntreelimit: 50
evaluation_log:
Best iteration:
As we in the above output we got minimum error iteration 50.
Now training the model on train data set.
#first default - model training
xgb1 <- xgb.train (params = params, data = dtrain, nrounds = 50, watchlist = list(val=dtestfinal,train=dtrain), print.every.n = 10, early.stop.round = 10, maximize = F , eval_metric = "mlogloss")
'print.every.n' is deprecated.
Use 'print_every_n' instead.
See help("Deprecated") and help("xgboost-deprecated").'early.stop.round' is deprecated.
Use 'early_stopping_rounds' instead.
See help("Deprecated") and help("xgboost-deprecated").
[1] val-mlogloss:1.855254 train-mlogloss:1.854063
Multiple eval metrics are present. Will use train_mlogloss for early stopping.
Will train until train_mlogloss hasn't improved in 10 rounds.
[11] val-mlogloss:0.931260 train-mlogloss:0.926395
[21] val-mlogloss:0.676190 train-mlogloss:0.670568
[31] val-mlogloss:0.547322 train-mlogloss:0.541441
[41] val-mlogloss:0.469425 train-mlogloss:0.463405
[50] val-mlogloss:0.417579 train-mlogloss:0.411473
Now we are using our model on test data set.
#model prediction
xgbpred <- predict (xgb1,dtestfinal)
checking the results of the model.
result<-confusionMatrix(table(as.factor(test_train_data$surface), as.factor(xgbpred)),mode = "prec_recall")
result
Confusion Matrix and Statistics
0 1 2 3 4 5 6 7 8
0 4698 646 8 0 0 0 131 107 455
1 28 22376 383 0 333 453 218 351 835
2 0 496 10035 24 84 205 43 466 382
3 0 0 4 687 0 0 38 3 0
4 23 871 180 0 8309 74 0 29 246
5 253 627 222 0 69 20825 473 316 419
6 81 162 19 42 0 340 8906 153 32
7 77 191 249 36 22 529 134 15059 48
8 144 1256 389 0 231 499 37 14 16845
Overall Statistics
Accuracy : 0.8837
95% CI : (0.8819, 0.8855)
No Information Rate : 0.2184
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8636
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
Precision 0.77717 0.8959 0.85513 0.938525 0.85378 0.8975 0.91484 0.9213 0.8676
Recall 0.88575 0.8404 0.87344 0.870722 0.91832 0.9084 0.89238 0.9128 0.8745
F1 0.82791 0.8673 0.86419 0.903353 0.88488 0.9029 0.90347 0.9170 0.8711
Prevalence 0.04350 0.2184 0.09423 0.006471 0.07421 0.1880 0.08186 0.1353 0.1580
Detection Rate 0.03853 0.1835 0.08231 0.005635 0.06815 0.1708 0.07305 0.1235 0.1382
Detection Prevalence 0.04958 0.2049 0.09625 0.006004 0.07982 0.1903 0.07985 0.1341 0.1592
Balanced Accuracy 0.93710 0.9066 0.92902 0.935175 0.95286 0.9422 0.94249 0.9503 0.9247
Our Model has given accuracy of 88%.
The important features given by XGBoost
xgb.importance( model =xgb1)
As we see above the orientation_x is the most important feature for predicting the type of floor.
Now we are going to use our model for predicting the test data. preprocessing the test data.
dtestfinal<-xgb.DMatrix(as.matrix(test))
predicting the values for the Test data .
predicted_value<-predict (xgb1,dtestfinal)
class(predicted_value)
[1] "numeric"
Function for decoding the variables.
Decoding<-function(x){
if(x==0){
return("carpet")
}else if(x==1){
return("concrete")
}else if(x==2){
return("fine_concrete")
}else if(x==3){
return("hard_tiles")
}else if(x==4){
return("hard_tiles_large_space ")
}else if(x==5){
return("soft_pvc")
}else if(x==6){
return("soft_tiles")
}else if(x==7){
return("tiled")
}else{
return("wood")
}
}
carpet 0 concrete 1 fine_concrete 2 hard_tiles 3 hard_tiles_large_space 4 soft_pvc 5 soft_tiles 6 tiled 7 wood 8
Now we are doing Decoding our target variable to their respective categories given above .
predicted_value<-sapply(predicted_value,Decoding)
so Final predictions are :-
test$surface<-as.factor(predicted_value)
Viewing the test data after predictions.
This is the following expected output.